WNV analysis

Analysis of the WNV dataset from https://raw.githubusercontent.com/jdrakephd/ideas-workshop/master/wnv.csv.

Let’s examine our data to start -

head(wnv)
## # A tibble: 6 x 9
##   State        Year EncephMen Fever Other Total Fatal Latitude Longitude
##   <chr>       <int>     <int> <int> <int> <int> <int>    <dbl>     <dbl>
## 1 New York     1999        59     3     0    62     7     42.5     -75.3
## 2 Connecticut  2000         0     1     0     1     0     41.5     -72.8
## 3 New Jersey   2000         5     1     0     6     1     40.2     -74.7
## 4 New York     2000        14     0     0    14     1     42.5     -75.3
## 5 Alabama      2001         2     0     0     2     1     32.3     -86.9
## 6 Connecticut  2001         6     0     0     6     1     41.5     -72.8
wnv %>% select(-State) %>% ggpairs(.)

Including Plots

How do the cases per year change?


Although hard to read, we can see a definite change in both cases per year and a difference in cases per state.

Neuroinvasive disease rates

WNV can cause meningitis, but not in all cases.

What is the neuroinvasive disease rate of WNV, and how does this differ by state?

#First, let us calculate the neuroinvasive disease rate:
wnv <- mutate(wnv, ndr = EncephMen / Total)

#Then, we define two functions to calculate the Standard Error (se) and Mean of the ndr:

#function for finding mean
fun_mean <- function(x) {
  l <- length(x)
  s <- sum(x)
  m <- s / l
  return(m)
}

#function for finding standard error
fun_se <- function (x){
  se <- sd(x) / sqrt(length(x))
  return(se)
}

#Then, we can plot the mean ndr per state and include the standard error:
wnv %>%
  group_by(State) %>%
  summarise(ndr_mean = fun_mean(ndr), ndr_se=fun_se(ndr)) %>%
  ggplot(aes(x = State, y = ndr_mean)) +
  geom_bar(stat = "identity") + 
  geom_errorbar(aes(ymin = ndr_mean - ndr_se, 
                   ymax = ndr_mean + ndr_se)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 2 rows containing missing values (geom_errorbar).

East vs West, North vs South - death rates

Is there a location effect on WNV death rates?

We can check!

To start, we’ll define the case fatality rate (CFR) -

wnv %>% 
  ungroup() %>%
  group_by(State, Year) %>% 
  summarise(CFR = (Fatal / Total) * 100) 
## # A tibble: 272 x 3
## # Groups:   State [?]
##    State    Year   CFR
##    <chr>   <int> <dbl>
##  1 Alabama  2001 50.0 
##  2 Alabama  2002  6.12
##  3 Alabama  2003  8.11
##  4 Alabama  2004  0.  
##  5 Alabama  2005 20.0 
##  6 Alabama  2006  0.  
##  7 Alabama  2007 12.5 
##  8 Arizona  2003  7.69
##  9 Arizona  2004  4.09
## 10 Arizona  2005  4.42
## # ... with 262 more rows
wnv$CFR <- (wnv$Fatal / wnv$Total) * 100

head(wnv)
## # A tibble: 6 x 11
##   State    Year EncephMen Fever Other Total Fatal Latitude Longitude   ndr
##   <chr>   <int>     <int> <int> <int> <int> <int>    <dbl>     <dbl> <dbl>
## 1 New Yo~  1999        59     3     0    62     7     42.5     -75.3 0.952
## 2 Connec~  2000         0     1     0     1     0     41.5     -72.8 0.   
## 3 New Je~  2000         5     1     0     6     1     40.2     -74.7 0.833
## 4 New Yo~  2000        14     0     0    14     1     42.5     -75.3 1.00 
## 5 Alabama  2001         2     0     0     2     1     32.3     -86.9 1.00 
## 6 Connec~  2001         6     0     0     6     1     41.5     -72.8 1.00 
## # ... with 1 more variable: CFR <dbl>

We’ll first look at the difference between Eastern and Western states. We’ll classify each state as either ‘East’ or ‘West’, depending on their location in relation to the mean latitude of the states.

wnv$east_west <- ifelse(wnv$Longitude < mean(wnv$Longitude), "West", "East")
table(wnv$east_west)
## 
## East West 
##  151  121

Then, we can group_by the state’s location, and test to see if there is a difference in rates:

wnv %>%
  group_by(east_west) %>%
  summarise(CFR_mean = fun_mean(CFR), CFR_se=fun_se(CFR)) 
## # A tibble: 2 x 3
##   east_west CFR_mean CFR_se
##   <chr>        <dbl>  <dbl>
## 1 East          7.28  0.766
## 2 West          3.91  0.354
#non-parametric test needed
hist(wnv$CFR)

wilcox.test(wnv$CFR ~ wnv$east_west)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  wnv$CFR by wnv$east_west
## W = 10530, p-value = 0.02803
## alternative hypothesis: true location shift is not equal to 0

What about North vs South?

ggplot(wnv, aes(x = Latitude, y = CFR)) +
         geom_point()

ggplot(wnv, aes(x = Latitude, y = CFR)) +
  geom_smooth(method = "gam")

summary(lm(CFR ~ Latitude, data = wnv))
## 
## Call:
## lm(formula = CFR ~ Latitude, data = wnv)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.898 -4.935 -2.031  2.102 60.867 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.59437    3.85366   3.528 0.000493 ***
## Latitude    -0.20065    0.09821  -2.043 0.042011 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.611 on 270 degrees of freedom
## Multiple R-squared:  0.01523,    Adjusted R-squared:  0.01158 
## F-statistic: 4.174 on 1 and 270 DF,  p-value: 0.04201

Interactive plots!

Let’s look at the North vs South plot in more detail -

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`